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ABSTRACT 

So far, privacy models follow two paradigms. The first paradigm, 
termed inferential privacy in this paper, focuses on the risk 
due to statistical inference of sensitive information about a 
target record from other records in the database. The second 
paradigm, known as differential privacy, focuses on the risk 
to an individual when included in, versus when not included 
in, the database. The contribution of this paper consists of 
two parts. The first part presents a critical analysis on dif- 
ferential privacy with two results: (i) the differential privacy 
mechanism does not provide inferential privacy, (ii) the im- 
possibility result about achieving Dalenius's privacy goal [s] 
is based on an adversary simulated by a Turing machine, but 
a human adversary may behave differently; consequently, the 
practical implication of the impossibility result remains un- 
clear. The second part of this work is devoted to a solution 
addressing three major drawbacks in previous approaches to 
inferential privacy: lack of flexibility for handling variable 
sensitivity, poor utility, and vulnerability to auxiliary infor- 
mation. 

1. INTRODUCTION 

There has been a significant interest in the analysis of 
data sets whose individual records are too sensitive to ex- 
pose directly. Examples include medical records, financial 
data, insurance data, web query logs, user rating data for 
recommender systems, personal data from social networks, 
etc. Data of this kind provide rich information for data anal- 
ysis in a variety of important applications, but access to such 
data may pose a significant risk to individual privacy, as il- 
lustrated in the following example. 

Example 1. A hospital maintains an online database for 
answering count queries on medical data like the table T in 
Table [7] T contains three columns, Gender, Zipcode, and 
Disease, where Dtsease is a sensitive attribute. Suppose that 
an adversary tries to infer the disease of an individual Alice, 



Gender 


Zipcode 


Disease 


M 


54321 


Brain Tumor 


M 


54322 


Indigestion 


F 


61234 


Cancer 


F 


61434 


HIV 









Table 1: A table T 

with the background knowledge that Alice, a female living in 
the area with Zipcode 61434, has a record in T. The adversary 
issues the following two queries Qi and Q2: 

Qi: SELECT COUNT (*) FROM T WHERE Cender=F 
AND Zipcode^61434 

Q2: SELECT COUNT (*) FROM T WHERE Cender^F 
AND Zipcode^61434 AND Disease=HIV 

Each query returns the number of participants (records) 
who match the description in the WHERE clause. Suppose 
that the answers for Qi andQ2 arex andy, respectively. The 
adversary then estimates that Alice has HIV with probability 
y/x, and if y/x and x are "sufficiently large", there will be a 
privacy breach. 

1.1 Inferential vs Differential 

In the above example, the adversary infers that the rule 

{Gender = F A Zipcode = 61434) {Disease = HIV) 

holds with the probability y/x and that Alice has HIV with 
the probability y/x, assuming that the (diseases of) records 
follow some underlying probability distribution. This type of 
reasoning, which learns information about one record from 
the statistics of other records, is found in many advanced ap- 
plications such as recommender systems, prediction models, 
viral marketing, social tagging, and social networks. The 
same technique could be misused to infer sensitive informa- 
tion about an individual like in the above example. According 
to the Privacy Act of Canada, publishing the above query an- 
swers would breach Alice's privacy because they disclose Al- 
ice's disease with a high accuracy. In this paper, inferential 
privacy refers to the requirement of limiting the statistical 
inference of sensitive information about a target record from 
other records in the database. See [I] for a list of works in 
this field. 

One recent breakthrough in the study of privacy preserva- 
tion is differential privacy [5j [Tj . In an "impossibility result", 
the authors of [5][7] showed that it is impossible to achieve 
Dalenius's absolute privacy goal for statistical databases: any- 



thing that can be learned about a respondent from the sta- 
tistical database should be learnable without access to the 
database. Instead of limiting what can be learnt about one 
record from other records, the differential privacy mechanism 
hides the presence or absence of a participant in the database, 
by producing noisy query answers such that the distribu- 
tion of query answers changes very little when the database 
differs in any single record. The following definition is from 

i- 

Definition 1. A randomized function K gives e-differential 
privacy if for all data sets T and T' differing on at most one 
record, for all queries Q, and for all outputs x, Pr[K{T, Q) — 
x] < exp{e)Pr[K{T',Q) = x]. 

With a small e, the presence or absence of an individual 
is hidden because T and T' are almost equally likely to be 
the underlying database that produces the final output of 
the query. Some frequently cited claims of the differential 
privacy mechanism are that it provides privacy without any 
assumptions about the data and that it protects against ar- 
bitrary background information. But there is no free lunch in 
data privacy, as pointed out by Kifer and Machanavajjhala 
recently 14 . Their study shows that assumptions about the 



data and the adversaries are required if hiding the evidence 
of participation, instead of the presence/absence of records 
in the database, is the privacy goal, which they argue should 
be a major privacy definition. 

1.2 Contributions 

The contribution of this paper consists of two parts. In 
the first part, we argue that differential privacy is insufficient 
because it does not provide inferential privacy. We present 
two specific results: 

• (Section 2.1) Using a differential inference theorem, we 
show that the noisy query answers returned by the dif- 
ferential privacy mechanism may derive an inference 
probability that is arbitrarily close to the inference prob- 
ability obtained from the noise-free query answers. This 
study suggests that providing inferential privacy remains 
a meaningful research problem, despite the protection 
of differential privacy. 

• (Section 2.2) While the impossibility result in [s] is 
based on an adversary simulated by a Turing machine, 
a human adversary may behave differently when evalu- 
ating the sensitivity of information. We use the Terry 
Gross example, which is a key motivation of differen- 
tial privacy, to explain this point. This study suggests 
that the practical implication of the impossibility result 
remains unclear. 

Given that inferential privacy remains relevant, the second 
part of this work is devoted to stronger solutions for infer- 
ential privacy. Previous approaches suffer from three major 
limitations. Firstly, most solutions are unable to handle sen- 
sitive values that have skewed distribution and varied sen- 
sitivity. For example, with the Occupation attribute in the 
Census data (Section 7) having the minimum and maximum 
frequency of 0.18% and 7.5%, the maximum ^-diversity [19| 
that can be provided is 13-di versify because of the eligibility 
requirement l/£ > 7.5% [22]. Therefore, it is impossible to 



protect the infrequent items at the tail of the distribution or 
more sensitive items by a larger ^-diversity, say 50-diversity, 
which is more than 10 times the prior 0.18%. Secondly, even if 
it is possible to achieve such i'-diversity, enforcing ^-diversity 
with a large £ across all sensitive values leads to a large infor- 
mation loss. Finally, previous solutions are vulnerable to ad- 
ditional auxiliary information [2l] [Ts] [Tt] . We address these 
issues in three steps. 

• (Section 3) To address the first two limitations in the 
above, we consider a sensitive attribute with domain 
values xi, - ■ ■ , Xm such that each Xi has a different sen- 
sitivity, thus, a tolerance on inference probability. 
We consider a bucketization problem in which buckets 
of different sizes can be formed to accommodate differ- 
ent requirements /,'. The goal is to find a collection of 
buckets for a given set of records so that a notion of 
information loss related to bucket size is minimized and 
the privacy constraint of all XiS is satisfied. 

• (Sections 4, 5, and 7) 

We present an efficient algorithm for the case of two 
distinct bucket sizes (but many buckets) with guaran- 
teed optimality, and a heuristic algorithm for the general 
case. The empirical study on real life data sets shows 
that both solutions are good approximations of opti- 
mal solutions in the general case and better deal with 
a sensitive attribute of skewed distribution and varied 
sensitivity. 

• (Section 6) We adapt our solutions to guard against two 
previously identified strong attacks, corruption attack 
[21] and negative association attack 13 17] (s 
details in Section 6). 



see more 



1.3 Related Work 

Limiting statistical disclosure has been a topic extensively 
studied in the field of statistical databases, see [I] for a list of 
works. This problem was recently examined in the context of 
privacy preserving data publishing and some representative 
privacy models include pi-p2 privacy |9j, £-di versify principle 



19 , and t-closeness 16 . All of these works assume uniform 
sensitivity across all sensitive values. One exception is the 
personalized privacy in [23] where a record owner can specify 
his/her privacy threshold. Another exception is where 
each sensitive value may have a different privacy setting. To 
achieve the privacy goal, these works require a taxonomy of 
domain values to generalize the attributes, thus, cannot be 



21 



applied if such taxonomy is not available. The study in 22 
shows that generalized attributes are not useful for count 
queries on raw values. Dealing with auxiliary information is 
a hard problem in data privacy 
is little satisfactory solution. 

There have been a great deal of works in differential pri- 
vacy since the pioneer work [t] [sl. This includes, among 
others, contingency table releases |2|, estimating the degree 



[I3 17 , and so far there 



distribution of social networks [111 7 histogram queries 12 
and the number of permissible queries [2^. These works are 
concerned with applications of differential privacy in various 
scenarios. Unlike previous works, the authors of [14] argue 
that hiding the evidence of participation, instead of the pres- 
ence/absence of records in the database, should be a major 



privacy definition, and this privacy goal cannot be achieved 
with making assumptions about the data and the adversaries. 

2. ANALYZING DIFFERENTIAL PRIVACY 

This section presents a critical analysis on the differential 
privacy mechanism. In Section [2. 1| we show that the differen- 
tial privacy mechanism allows violation of inferential privacy. 
In Section [2.2| we argue that a human adversary may behave 
differently from some assumptions made in the impossibility 
result of 5", thus, the practical implication of the impossibil- 
ity result remains unclear. 

2.1 On Violating Inferential Privacy 

One popularized claim of the differential privacy mecha- 
nism is that it protects an individual's information even if an 
attacker knows about all other individuals in the data. We 
quote the original discussion from |3 (pp 3): 

"If there is information about a row that can be 
learned from other rows, this information is not 
truly under the control of that row. Even if the 
row in question were to sequester itself away in 
a high mountaintop cave, information about the 
row that can be gained from the analysis of other 
rows is still available to an adversary. It is for 
this reason that we focus our attention on those 
inferences that can be made about rows without 
the help of others." 

In other words, the differential privacy framework does not 
consider violation to inferential privacy and the reason is that 
it is not under the control of the target row. Two points need 
clarification. Firstly, a user submits her sensitive data to an 
organization because she trusts that the organization will do 
everything possible to protect her sensitive information; in- 
deed, the data publisher has full control in how to release the 
data or query answers in order to protect individual privacy. 
Secondly, learning information about one record from other 
records could pose a risk to an individual if the learnt infor- 
mation is accurate about the individual. This type of learning 
assumes that records follow some underlying probability dis- 
tribution, which is widely adapted by prediction models in 
many real applications. Under this assumption, suppose Q\ 
and Q2 in Example [l] have the answers x = 100 and y — 99, 
even if Alice's record is removed from the database, it is still 
valid to infer that Alice has HIV with a high probability. 

Next, we show that even if the differential privacy mecha- 
nism adds noises to the answers for queries Q\ and Q2, Alice's 
disease can still be inferred using the noisy answers. 

Let X and y be the true answers to Q\ and Q2- We assume 
that X and y are non-zero. The differential privacy mecha- 
nism will return the noisy answers X = x + ^i and Y = y-f ^2 
for Qi and Q2, after adding noises ^1 and ^2. Consider the 
most used Laplace distribution Lap{b) = ^exp{—\(,\/b) for 
the noise 5, where b is the scale factor. The mean E[£] is zero 
and the variance «ar[^] is 2b^. The next theorem is due to 

E- 

Theorem L For a count query Q, the mechanism K 
that adds independently generated noise ^ with distribution 
Lap{l/e) to the output enjoys e- differential privacy. 

The next theorem shows that Y/X is a good approximation 
oi y/x. 



Theorem 2 (Differential Inference Theorem). Given 
two queries Qi and Q2 as above, let x and y be the true 
answers and let X and Y be the answers returned by the 
£ -differential privacy mechanism. El^] = ^(1 + ^) "-^d, 
var[^] = ^{l + {^f), where b=l/£. 

Proof. Using the Taylor expansion technique [s] [20], the 
mean and variance uar[-^] of Y/X can be approximated 

as follows: 

Y ^ cov[X,Y] var[X]E[Y] 

^X^~E[X] E[X]-^ E[X]^ 

,Y, var[Y] 2E[Y] ,^ E[Y]^ 

E[X] and E[Y] are equal to the true answers x and y of Qi 
and Q2. warlX] and ?;ar[y] are 2b^ for Lap(b). cov[X,Y] = 
cov[a;-|-^i, i/-|-^2] = co-i;[^i, ,^2]. Since and ^2 are unrelated, 
covfCi, C2] = 0. Simplifying the above equations, we get E[^] 
and uar[-^] as required. 

□ 

The next corollary follows from the fact that - < 1 and b 
is a constant for a given e-differential privacy mechanism K. 

Corollary 1. Let X,Y be defined as in Theorem^ As 
the query size x for Qi increases, E[^] gets arbitrarily close 
to I and var[^] gets arbitrarily close to zero. 

Corollary [1] suggests that Y/X, where Y and X are the 
noisy query answers returned by the differential privacy mech- 
anism, can be a good estimate of the inference probability 
y/x for a large query answer x. For example, for e = 0.1 and 
X = 100, ^ = 0.02, and following Theorem [i] E[^] is 1.02 
times |; ff 2; = 1000, E[^] is 1.0002 times |. If y/x is high, 
inferential privacy is violated. Note that ?;ar[-^] is small in 
these cases. 

2.2 On The Impossibility Results 

A key motivation behind differential privacy is the impossi- 
bility result about the Dalenius's privacy goal [H]. Intuitively, 
it says that for any privacy mechanism and any distribution 
satisfying certain conditions, there is always some particular 
piece of auxiliary information, z, so that z alone is useless to 
an adversary who tries to win, while z in combination with 
access to the data through the privacy mechanism permits 
the adversary to win with probability arbitrarily close to 1. 
The proof assumes an adversary simulated by a Turing ma- 
chine. We argue that a human adversary, who also considers 
the "semantics" when evaluating the usefulness of informa- 
tion, may behave differently. Let us explain this point by the 
Terry Gross example that was originally used to capture the 
intuition of the impossibility result in [6]. 

In the Terry Gross example, the exact height is considered 
private, thus, useful to an adversary, whereas the auxiliary 
information of being two inches shorter than an unknown av- 
erage is considered not private, thus, not useful. Under this 
assumption, accessing the statistical database, which returns 
the average height, is to blame for disclosing Terry Gross's 
privacy. Mathematically, knowing the exact height is a re- 
markable progress from knowing two inches shorter than an 



unknown average. However, to a human adversary, the infor- 
mation about liow an individual deviates from the statistics 
already discloses the sensitive information, regardless of what 
the statistics is. For example, once knowing that someone 
took the HIV check-up ten times more frequently than an 
unknown average, his/her privacy is already leaked. Here, a 
human adversary is able to interpret "deviation" as a sensitive 
notion based on "life experiences", even though mathemati- 
cally deviation does not derive the exact height. It is unclear 
whether such a human adversary can be simulated by a Tur- 
ing machine. 

In practice, a realistic privacy definition does allow dis- 
closure of sensitive information in a controlled manner and 
there are scenarios where it is possible to protect inferential 
privacy while retaining a reasonable level of data utility. For 
example, the study in [To] shows that the anonymized data 
is useful for training a classifier because the training does not 
depend on detailed personal information. Another scenario 
is when the utility metric is different from the adversary's 
target. Suppose that the attribute Disease is sensitive and 
the response attribute R (to a medicine) is not. Learning the 
following rules does not violate privacy 



[Disease — xi) 
[Disease — X2) 



[R = Positive) 
[R = Positive) 



in that a positive response does not indicate a specific dis- 
ease with certainty. However, these rules are useful for a 
researcher to exclude the diseases xi and X2 in the absence 
of a positive response. Even for a sensitive attribute like 
Disease, the varied sensitivity of domain values (such as Flu 
and HIV) could be leveraged to retain more utility for less 
sensitive values while ensuring strong protection for highly 
sensitive items. In the rest of the paper, we present an ap- 
proach of leveraging such varied sensitivity to address some 
drawbacks in previous approaches to inferential privacy. 

3. PROBLEM STATEMENT 

This section defines the problem we will study. First, we 
present our model of adversaries, privacy, and data utility. 

3.1 Preliminaries 

The database is a microdata table T[QI, SA) with each 
record corresponding to a participant. QI is a set of non- 
sensitive attributes {A\, • ■ ■ , Ad}. SA is a sensitive attribute 
and has the domain {x\, ■ ■ ■ , Xm}- rn is the domain size of 
SA, also written 15*^41. Each Xi is called a sensitive value or 
a 5*^4 value. Oi denotes the number of records for Xi in T and 
fi denotes the frequency Oi/\T\, where |r| is the cardinality 
of T. For a record r in T, t[QI] and r[SA] denote the values 
of r on QI and 5*^4. Table [3] lists some of the notations used 
in this paper. 

An adversary wants to infer the SA value of a target indi- 
vidual t. The adversary has access to a published version of 
T, denoted by T*. For each SA value Xi, Pr[xi\t, T*) denotes 
the probability that t is inferred to have Xi. For now, we con- 
sider an adversary with the following auxiliary information: 
a t's record is contained in T, f's values on QI, i.e., t[QI], 
and the algorithm used to produce T* . Additional auxiliary 
information will be considered in Section [G] 

One approach for limiting Pr[xi\t, T") is bucketization 
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BID 
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54321 
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54322 
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61234 
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61434 
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Disease 


1 


Brain Tumor 


1 


Indigestion 
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Caneer 


2 


HIV 







(a) QIT (b) ST 

Table 2: An anonymized table T* 



r, |T| 


the raw data and its cardinality 


m 


domain size of SA 


Xi 


a sensitive value 


Ot 


number of occurrence of Xi in T 


U 




fi 


privacy threshold for Xi 


_F'-privacy 


a collection of fl for Xi 




bj buckets of size Sj 


s[Bj) 


total size of buckets in Bj 



22 



Table 3: Notations 



BID. We use g to refer to both a bucket and the bucket 
ID of a bucket, depending on the context. T* is published 
in two tables, QIT[QI,BID) and ST[BID,SA). For each 
record r in T that is grouped into a bucket g, QIT con- 
tains a record {r[QI],g) and ST contains a record [g,r[SA]) 
(with duplicates preserved). For a target individual t with 
t[QI] contained in a bucket g, the probability of inferring a 
SA value Xi using g, Pr[xi\t,g), is equal to |g,a;i|/|g|, where 
|g,a;i| denotes the number of occurrence of [g,Xi) in ST and 
\g\ denotes the size of g. Pr[xi\t,T*) is defined to be the 
maximum Pr[xi\t,g) for any bucket g containing t[QI] [22| . 

Example 2. For the microdata T in Table\^ Gender and 
Zipcode are the QI attributes and Disease is SA. Table 
shows the QIT and ST for one bucketization. To infer the 
SA value of Alice with QI = 61434), the adversary first 
locates the bucket that contains (F, 61434), i.e., BID = 2. 
There are two diseases in this bucekt, Cancer and HIV, each 
occurring once. So Pr[xi\Alice,2) — 50%, where Xi is either 
Cancer or HIV. 

3.2 Privacy Specification 

We consider the following privacy specification. 

Definition 2 (F'-Privacy). For each S A value Xi, fi- 
privacy specifies the requirement that Pr[xi\t,T*) < f[, where 
f[ is a real number in the range (0,1]. F'-privacy is a collec- 
tion of fi -privacy for all SA values Xi. 

For example, the publisher may set — 1 for some Xi's 
that are not sensitive at all, set manually to a small value 
for a few highly sensitive values Xi, and set f^ = mm{l, a x 
fi -\- b} for the rest of 5*^4 values whose sensitivity grows lin- 
early with their frequency, where a and h are constants. Our 
approach assumes that fl is specified but does not depend on 
how fi is specified. The next lemma follows easily and the 
proof is omitted. 



In this approach, the records in T are grouped into small-size 
buckets and each bucket is identified by a unique bucket ID, 



Lemma 1. A bucketization T' 
if and only if fi > fi for all Xi . 



satisfying F' -privacy exists 



Remar k 1. To model a given F' -privacy specification by 
i-diversity the smallest i required is set hy t — \l/minif-] . 
If some Xi is highly sensitive, i.e., has a small fi, this £ will 
be too large for less sensitive Xi 's. This leads to poor utility 
for two reasons. First, the previous bucketization pro- 
duces buckets of the sizes I or £ -\- 1. Thus, a large £ leads 
to large buckets and a large information loss. Second, a large 
£ implies that the eligibility requirement \20j for having a £- 
diversity T* , i.e., \/£> maxifi, is more difficult to satisfy. 
In contrast, the corresponding eligibility requirement for hav- 
ing F' -privacy T* is f- > fi for all Xi's (Lemma^, which 
is much easier to satisfy. In Section 3.4, we will address the 
large bucket size issue by allowing buckets of different sizes 
to be formed to accommodate different requirements fl . 

3.3 Utility Metrics 

Within each bucket g, the QI value of every record is 
equally likely associated with the SA value of every record 
through the common BID. Therefore, the bucket size \g\ 
serves as a measure of the "disorder" of such association. 
This observation motivates the following notion of informa- 
tion loss. 

Definition 3. LetT* consist of a set of buckets {gi,- ■ ■ ,gb}. 
The Mean Squared Error (MSB) ofT* is defined by 

MSE{T') = ^ ' (1) 

Any bucketization T* has a MSB in the range [0, |r| — 1]. 
The raw data T is one extreme where each record itself is 
a bucket, so MSB — 0. The single bucket containing all 
records is the other extreme where MSB = |T| — 1. With 
\T\ being fixed, to minimize MSB, we shall minimize the 
following loss metric; 

Loss{T*) = Y.(M - if (2) 

! = 1 

Note that Loss has the additivity property; if T* = U T2 , 
then LossiT") = Loss{T^) + Loss{T2). 

3.4 Problem Description 

To minimize Loss, we consider a general form of bucketi- 
zation in which buckets of different sizes can be formed so 
that a large bucket size is used for records having a more 
sensitive Xi (i.e., a small //) and a small bucket size is used 
for records having less sensitive Xi (i.e., a larger //). A col- 
lection of buckets can be specified by a bucket setting of the 
form {_Bi(5'i, 61), • • ■ , Bq(Sq,bq)) , where bj is the number of 
buckets of the size Sj , j — 1, ■ ■ ■ , q, and Si < ■ ■ ■ < Sg. We 
also denote a bucket setting simply by UBj. s{Bj) — bjSj 
denotes the total size of the buckets in Bj . Following Defini- 
tion [2] the collection of buckets specified by UBj has the loss 
I]j=i bj X {Sj - if. We denote this loss by Loss{uBj). 

A bucket setting UBj is feasible wrt T if J2j s(-Bj) = \T\. 
A feasible bucket setting is valid wrt F'-privacy if there is an 
assignment of the records in T to the buckets in UBj such 
that no SA value Xi has a frequency more than // in any 
bucket g, i.e., Pr{xi\t,g) < f-. Such assignment is called a 
valid record assignment. 



Definition 4 (Optimal multi-size bucket setting). 
Given T and F -privacy, we want to find a valid bucket setting 
{Bi{Si,bi), - ■ ■ , Bq{Sq,bq)) thttt has thc minimum Loss{uB j) 
among all valid bucket settings. 

This problem must determine the number q of distinct 
bucket sizes, each bucket size Sj and the number bj of buck- 
ets for the size Sj , 1 < j < q. The following special case is a 
building block of our solution. 

Definition 5 (Optimal two-size bucket setting). 
Given T and F -privacy, we want to find a valid two-size 
bucket setting {Bi{Si,bi), B2{S2,b2)) that has the minimum 
loss among all valid two-size bucket settings. 

Remark 2. The bucket setting problem is challenging for 
several reasons. Firstly, allowing varied sensitivity and 
buckets of different sizes Sj introduces the new challenge of 
finding the best bucket setting that can fulfil the requirement 
fi for all Xi 's. Bven for a given bucket setting, it is non- 
trivial to validate whether there is a valid record assignment 
to the buckets. Secondly, the number of feasible bucket set- 
tings of the form {{Si, 61), • ■ ■ , {Sq, bq)} is huge, rendering it 
prohibitive to enumerate all bucket settings. For example, 
suppose that Si and S2 are chosen from the range of [3,20], 
and \T\ = 1,000,000, there are a total of 2,077,869 feasible 
bucket settings of the form {Si, bi) and {S2, &2)- This number 
will be much larger if q > 2. Finally, the number of distinct 
bucket sizes q is unknown in advance and must be searched. 

Section 4 presents an algorithm for validating a two-size 
bucket setting. Section 5 presents an efficient algorithm for 
the optimal two-size bucket setting problem with guaranteed 
optimality, and a heuristic algorithm for the multi-size bucket 
setting problem. 

4. VALIDATING TWO-SIZE BUCKET SET- 
TING 

Let Valid{B, T, F') denote a function that tests if a bucket 
setting B is valid. We assume that the number of occurrence 
Oi for Xi in T has been collected, 1 < i < m. In Section 4.1, we 
consider buckets having the same size and we give an 0{m) 
time and space algorithm for evaluating Valid{B,T, F'). In 
Section 4.2, we consider buckets having two different sizes and 
give an 0{m) time and space algorithm for Valid{B ,T, F'). 
In both cases, we give a linear time algorithm for finding a 
valid record assignment for a valid bucket setting. 

4.1 One-Size Bucket Setting 

Let B = {go, ■ ■ ■ , gt-i} be a set of 6 buckets of the same 
size S. To validate this bucket setting, we introduce a round- 
robin assignment of records to buckets. 

Round-Robin Assignment (RRA): For each value Xi, 
1 < i < m, we assign the t-t\i record of Xi to the bucket 
gs, where s = {oi -\- ■ ■ ■ -\- Oi-i -f t) mod 6, where Oi is the 
number of occurrence of Xi in T. In other words, the records 
for Xi are assigned to the buckets in a round-robin manner; 
the order in which Xi is considered by RRA is not important. 
It is easy to see that the number of records for Xi assigned to 
a bucket is either [|oi|/6J or [|oi|/6] . The next lemma gives a 
sufficient and necessary condition for Valid{B, T, F') = true. 



Lemma 2 (Validating one-size bucket setting). Let 
B be a set ofb buckets of size S such that \T\ — s{B). The fol- 
lowing are equivalent: (1) Valid{B,T, F') = true. (2) There 
is a valid RRA from T to B wrt F' . (3) For each SA value 
Xi, < f'i. (4) For each SA value Xi, Oi < [flS\b. 

Proof. We show 4^3=>2=>1^4. Observe that if 
r is a real number and i is an integer, r < i if and only if 
\r] < i, and i < r if and only if i < [rj . Then the following 
rewriting holds. 

^ < fl ^ \Or/b] < fiS^ \Oi/b] < [fiS\ ^0,/b< 

[/,' SJ <^ Oi < [fiS\ b. This shows the equivalence of 4 and 3. 

To see 3 => 2, observe that ''^ is the maximum fre- 
quency of Xi in a bucket generated by RRA. Condition 3 
implies that this assignment is valid. 2 => 1 follows because 
every valid RRA is a valid assignment. To see 1 4, observe 
that f'-privacy implies that the number of occurrence of Xi 
in a bucket of size S is at most [/i'5J . Thus for any valid as- 
signment, the total number of occurrence Oi in the b buckets 
of size S is no more than lfiS\b. □ 

4.2 Two-Size Bucket Setting 

Now we consider a two-size bucket setting of the form 
{Bi{Si, bi), -62(^2, 62)). The next lemma follows trivially. 

Lemma 3. Valid{BiUB2, T,F') = true if and only if there 
is a partition of T , {Ti,T2}, such that Valid{Bi,T\,F') = 
true and Valid{B2,T2, F') = true. 

Definition 6. Given F' -privacy, for each Xi and for j = 
1,2, we define Uij = lflSj\bj and Oij — min{uij,Oi}. 

From Lemma[2|^4), Uij is the upper bound on the number of 
records for Xi that can be allocated to Bj without violating 
/j'-privacy, assuming unlimited supply of Xi records, aij is 
the upper bound, assuming the actual supply of Xi records, 
i.e., Oi. The next theorem gives the condition for Valid[B\ U 
B2,T,F') = true. 

Theorem 3 (Validating two-size bucket setting). 
Valid{BiU B2, T,F') = true if and only if all of the following 
conditions hold: 

Vi : an -\- ai2 > Oi {Privacy Constraint{PC)) (3) 

j = 1, 2 : ^aij > s{Bj) [Fill Constraint(FC))) (4) 

i 

\T\ = s(Bi) + s{B2) (Capacity Constraint(CC))) (5) 

Proof. Intuitively, Equation ([3| says that the number of 
occurrence of Xi does not exceed the upper bound an +ai2 im- 
posed by F'-privacy on all buckets collectively, thus, the name 
Privacy Constraint. Equation Q says that under this upper 
bound constraint it is possible to fill up the buckets in Bj 
without leaving unused slots, thus, the name Fill Constraint. 
Equation ([5| says that the total bucket capacity matches the 
data cardinality, thus the name Capacity Constraint. Clearly, 
all these conditions are necessary for a valid assignment. The 
sufficiency proof is given by the algorithm in the next subsec- 
tion that finds a valid assignment of the records in T to the 
buckets in Bi and B2, assuming that the above conditions 
hold. □ 

In the rest of the paper, PC, FC, and CC denote Privacy 
Constraint, Fill Constraint, and Capacity Constraint in The- 
orem |31 



Corollary 2. For a set buckets B with at most two bucket 
sizes, Valid{B, T, F') = true can be tested in 0{m) time and 
0{m) space. 

4.3 Record Partitioning 

Suppose that PC, FC and CC in Theorem[3]hold. We show 
how to find a partition {Ti, r2} of T such that Valid{Bi,Ti, F') = 
true and Valid{B2,T2, F') — true. This provides the suffi- 
ciency proof for Theorem|3]because Lemmajslimplies Valid{B\U 
B2,T,F') = true. By finding the partitioii{ri, r2}, we also 
provide an algorithm for assigning records from T to the 
buckets in Bi U B2, that is, simply applying RRA to each 
of (T„B,), j = l,2. 

The partition {T\,T2} can be created as follows. For each 
SA value Xi, initially Ti contains any an records and T2 
contains the remaining Oi — an records for Xi. Since an < 
un, Lemma [2)^4) holds on (ri,i3i). (Note that in this case, 
Oi in Lemma 121 is the number of occurrence of Xi in Ti.) 
PC implies that the number of occurrence of Xi in T2, i.e., 
Oi — an, is no more than 0^2, therefore. Lemma [2|4) also 
holds on {T2,B2). FC implies |ri| > s{Bi). If |ri| = s{Bi), 
IT2I = s{B2) (i.e., CC), from the above discussion and Lemma 
|2) Valid{Bi,Ti,F') = true and Valid{B2,T2, F') = true. 
We are done. 

We now assume |Ti| > s(Bi), thus IT2I < s{B2). We need 
to move |Ti | — s{Bi) records from Ti to T2 without exceeding 
the upper bound ai2 for T2. FC implies that such moves are 
possible because there must be some Xi for which less than 
ai2 records are found in T2. For such Xi, we move records 
of Xi from Ti to T2 until the number of records for Xi in 
T2 reaches 0^2 or until IT2I — s{B2), whichever comes first. 
Since we move a record for Xi to T2 only when there are less 
than ai2 records for Xi in T2, the condition of Lemma |2|4) 
is preserved on {T2,B2). Clearly, moving a record out of Ti 
always preserves the condition of Lemma [2|^4) on {T\,B\). 
As long as |r2| < s{B2), the above argument can be repeated 
to move more records from Ti to T2. 

Eventually, we have IT2I = 5(^2), so Valid{Bi,Ti, F') = 
true and Valid{B2,T2, F') = true. The {Ti,T2} is the par- 
tition required. 
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Figure 1: The record assignment for Example [3] 

Example 3. Suppose f[ = 2 x fi -\- 0.05. Consider a table 
T containing 50 records with Oi for Xi as follows: 

xi-xs: Oi = 1, fi= 0.02 and fi = 0.09. 



Xg-xi2: Oi = 6, fi = 0.12 and f[ = 0.29. 
X13-X14: Oi — 9, fi = 0.18 and fi — 0.41. 

Consider the bucket setting Bi{Si — 4, 61 = 9),-B2(<S'2 = 
14, &2 = 1). Note CC in Theorem^ holds. Let us compute 
an and ai2 . 

an = min{uiT_,Oi}: For Xi-xs, un = [//S'ljbi = [0.09 x 
4J X 9 = 0, so an = 0. For xg-xi2, un = [0-29 x 4J x 9 = 9, 
an = 6. For xn-xn, un = [0.41 x 4J x 9 = 9, an — 9. 

ai2 = min{ui2,0i}: For xi-xs, Ui2 = [/i'5'2jb2 = [0.09 x 
14J X 1 = 1, = 1. For xg-xi2, Ui2 = [0.29 x 14J x 1 = 4, 
ai2 = 4. For xi^-xia, Ui2 = [0.41 x 14J x 1 = 5, ai2 = 5. 

It can be verified that PC and FC tn Theorem [5| hold. To 
find the partitioning {Ti,T2}, initially T\ contains a^i = 
record for each of xi-xa, an = 6 records for each of xg-xi2, 
and an = 9 records for each of a:i3-xi4. T2 contains the 
remaining records in T. Since T\ contains ^2 records, but 
s{Bi) = 36, we need to move 6 records from T\ to T2 without 
exceeding the upper bound 0^2 for T2. This can be done by 
moving one record for each of xg — X14 from T\ to T2. Figure 
^shows a record assignment generated by RRA for (Bi,Ti) 
and {B2,T2). 

5. FINDING OPTIMAL BUCKET SETTINGS 

We now present an efficient algorithm for finding the opti- 
mal bucket setting. Section 5.1 presents an exact solution for 
the two-size bucket setting problem. Section 5.2 presents a 
heuristic solution for the multi-size bucket setting problem. 

5.1 Algorithms for Two-Size Bucket Settings 

Given T and F'-privacy, we want to find the valid bucket 
setting of the form {Bi{Si, &i), B2{S2, ^2)), where bj > and 
Si < S2, such that the following loss is minimized 

Loss(Bi U B2) = fei(S'i - 1)" + 62(^2 - 1)' (6) 

One approach is applying Theorem |3] to validate each fea- 
sible bucket setting {B\,B2), but this is inefficient because 
the number of such bucket settings can be huge (Remark [2|. 
We present a more efficient algorithm that prunes the bucket 
settings that are not valid or do not have the minimum loss. 
Observe that /^'-privacy implies that a record for Xi must be 
placed in a bucket of size at least [l//i] ; therefore, the mini- 
mum size for 5i and 52 is M = mini{[l//j']}. The maximum 
bucket size M' for 5i and S2 is constrained by the maximum 
loss allowed. We assume that M' is given, where M' > M. 
We consider only {Si,S2) such that M < Si < S2 < M' . 
Note that a valid bucket setting may not exist in this range 
of size. 

5. 1. 1 Indexing Bucket Settings 

We first present an "indexing" structure for feasible bucket 
settings to allow a direct access to any feasible bucket set- 
ting. We say that a pair (61,62) is feasible (resp. valid) wrt 
(Si,^) if the bucket setting (51(5*1, 61), -B2(<S'2, 62)) is feasi- 
ble (resp. valid). A valid pair (61, 62) is optimal wrt {Si, S2) if 
Loss{BiU B2) is minimum among all valid pairs wrt (Si, S2). 
We define r(Si, S2) to be the list of all feasible (61, 62) in the 
descending order of bi, thus, in the ascending order of 62. In- 
tuitively, an earlier bucket setting has more smaller buckets, 
thus, a smaller Loss, than a later bucket setting. Below, we 
show that the i-th pair in r(S'i, S2) can be generated directly 
using the position i without scanning the list. We will use 



this property to locate all valid pairs by a binary search on 
r(Si,5'2) without storing the list. To this end, it suffices to 
identify the first and last pairs in r(5'i,S2), and the incre- 
ments of bi and 62 between two consecutive pairs. 

The first pair in r(Si, S2), denoted (b?, 62), has the largest 
possible bi such that Si 61 -I- 5262 = |T|. So (6?, 62) is the 
solution to the following integer linear program: 



mm{62 1 S161 -^^262 = \T\} 



(7) 



bi and 62 are variables of non-negative integers and Si, S2, \T\ 
are constants. 

Next, consider two consecutive pairs (61, 62) and (61— Ai, 62-I- 
A2) in r(Si, S2). Since S161 -f- S262 = \T\ and Si(6i - Ai) + 
52(62 -I- A2) = |r|, SiAi = S2A2. Since Ai and A2 are the 
smallest positive integers such that this equality holds, S2A2 
must be the least common multiple of Si and S2, denoted by 
LCM{Si, S2). A2 and Ai are then given by 

A2 = LCM(Si,S2)/S2, Ai = LCM{Si,S2)/Si (8) 

Therefore, the zth pair in r(Si,S2) has the form (6? - i * 
Ai, 62 -l-i * A2), where i > 0. The last pair has the maximum 
i such that < 6? - i * Ai < Ai, or 6?/Ai - 1 < i < 6?/Ai. 
The only integer i satisfying this condition is given by 



k = [6;/Aij 



(9) 



Lemma 4. r(Si,S2) has the form 
(6;,6^),(6;-Ai,6°-f A2), 
where 61, 62, Ai, A2, fc are defined m Equations {^^) 



(6?-fc*Ai,6^-f fc*A2) (10) 



Remark 3. r(Si,S2) in Lemma^has several important 
properties for dealing with a large data set. Firstly, we can ac- 
cess the i-th element ofT{Si, S2) without storing or scanning 
the list. Secondly, we can represent any sublist of V{S\, S2) 
by a bounding interval [i,j] where i is the starting position 
and j is the ending position of the sublist. Thirdly, the com- 
mon sublist of two sublists L and L' ofT{Si,S2), denoted by 
Ln L' , is given by the intersection of the bounding intervals 
of L and L' . 

Example 4. Let \T\ = 28, Si = 2, S2 = 4. LCM(Si, S2) = 
4. A2 = 4/4=1 and Ai = 4/2 = 2. 6? = 14, 6§ = 0. 
k = [14/2J = 7. r(Si,S2) IS (14,0), (12,1), (10,2), (8,3), 
(6,4), (4,5), (2,6), (0,7). 

The length k of r(Si, S2), given by Equation is propor- 
tional to the cardinality |r|. 6? is as large as |T|/Si (when 
62 = 0) and Ai is no more than S2. Thus k is as large as 
|T|/(SiS2). With Si and S2 being small, k is proportional 
to |T|. Therefore, examining all pairs in r(Si,S2) is not 
scalable. In the rest of this section, we explore two pruning 
strategies to prune unpromising pairs (61,62) in r(Si,S2), 
one based on loss minimization and one based on privacy 
requirement. 

5.1.2 Loss-Based Pruning 

Our first strategy is pruning the pairs in r(Si,S2) that 
do not have the minimum loss wrt (Si,S2), by exploiting 
the following monotonicity of Loss, which follows from the 
descending order of 61, Si < S2, and Equation ([6|. 



Lemmas (Monotonicity of loss). If {bi,b2) precedes 
(fe'1,62) mr{Si,S2). Loss(Bi UB2) < Loss{B[u B^), where 
Bj contains bj buckets of size Sj , and B'j contains bj buckets 
of size Sj, j = 1,2. 

Thus the first vahd pair in r(S'i, 5*2) is the optimal pair wrt 
{Si, S2). Lemma[5]can also be exploited to prune pairs across 
different {Si, 82)- Let Bestioss be the minimum loss found 
so far and (Si, S2) be the next pair of sizes to be considered. 
From Lemmajs] all the pairs in r(S'i, S2) that have a loss less 
than Bestioss must form a prefix of r(Si,5'2). Let (61,62) 
be the cutoff point of this prefix, where 61 = 
and 62 = 62 + fe* * A2. k* is the maximum integer satisfying 
bl{Si - 1)2 + 62(S'2 - 1)2 < Bestioss- k* is given by 

.* rn , Bestioss ~bUSi-l)^-bUS2-l)\ , ,,,, 
^ =-"^{0'L A2(g2-l)2-Ai(Si-l)^ J> 

The next lemma revises r(5'i, S2) by the cutoff point based 
on Bestioss- 

Lemma 6 (Loss-based pruning). Let Bestioss be the min- 
imum loss found so far and let (Si, 5*2) be the next pair of 
sizes to consider. Let k' = min{k, k*} , whe re k is given by 
Equation tXl and k* is given by Equation (11). LetV' {Si, S2) 



denote the prefix of T{Si, S2) that contains the first fc' + 1 
pairs. It suffices to consider V' {Si, S2) . 

In the rest of this section, F' denotes F'(Si,S2) when Si 
and 5*2 are clear from context. 

5.1.3 Privacy-Based Pruning 

From Lemma [5] the optimal pair wrt (^1,^2) is the first 
valid pair in F'. Our second strategy is to locate the first 
valid pair in F' directly by exploiting a certain monotonicity 
property of the condition for a valid pair. First, we intro- 
duce some terminology. Consider any sublist L of F' and any 
boolean condition C on a pair. H{C, L) denotes the set of 
all pairs in L on which C holds, and F{C, L) denotes the set 
of all pairs in L on which C fails. C is monotone in L if 
whenever C holds on a pair in L, it holds on all later pairs in 
L, and anti-monotone in L if whenever C fails on a pair in 
L, it fails on all later pairs in L. A monotone C splits L into 
two sublists F{C, L) and II{C, L) in that order, and an anti- 
monotone C splits L into two sublists II{C,L) and F{C,L) 
in that order. Therefore, if we can show that FC and PC in 
Theorem [3] are monotone or anti-monotone, we can locate all 
valid pairs in F', i.e., those satisfying both FC and PC, by a 
binary search over F'. We consider FC and PC separately. 

Monotonicity of FC. Let FC{Si) denote FC for j = 1, 
and FC{S2) denote FC for j = 2. Note that H{FC,T') is 
given by H{FC{Si), F') n H{FC{S2), V). 

Lemma 7 (Monotonicity of FC). FC{Si) is monotone 
in F' and FC{S2) is anti-monotone in F'. 

Proof. We rewrite FC as 

J2min^{if^Sl\bl,o^}> Slbl (12) 



mim { lfiS2\ 62 , Oj} > S2b2 



(13) 



decreases by a factor by b'l/bi, but Oi remains unchanged. 
Therefore, if Equation (12 1 holds for (61,62), it holds for 
(6'i,62) as well; so Equation (121 is monotone on F'. For 
a similar reason, if Equation (13 1 fails on ( 61, 62), it remains 
to fail on (6'i ,62) as well; thus Equation (131 is anti-monotone 
on F'. □ 

Monotonicity of PC. Let PC{xi) denote PC for Xi. 
H{PC, F') is given by riiH{PC{xi), F'). To compute H{PC{x,), F') 
we rewrite PC{xi) as 

■min{Yf[Si\bi,0i} + min{Yf[S2\b2,Oi} > Oi (14) 

Since 61 is decreasing and 62 is increasing in F', [/j'5'iJ6i > Oi 
is anti-mo not one and YfiS2\b2 > Oi is monotone in F'. Note 
Equation (pi holds in il{[fiSi\bi > o,, F') and //([//S2J62 > 

o^,r'). 

Let us consider the remaining part of F', denoted by r'{xi): 
F{[flSi\bi > o.,F') nF(L/;S2j62 > o.,F'). 



In this part. Equation (141, thus PC{xi), degenerates into 
L/;SiJ6i+L./-;S'2j62>o, (15) 

Consider 

L/:S2jA2 > L/^5iJAi (16) 
and any two consecutiv e pa irs (61, 62) and (61 — A i, 62 -I- A2) 



in r'{xi). If Equation (16 1 holds, Equation (15 1 holding on 
(61, 62) implies that it holds on (61 — Ai, 62-I- A2), thus. Equa- 
tion ( 15 1 is monotone; if Equation ( 16 1 fails. Equation ( 15 1 
failing on (61, 62 ) im plies that it fails on (61 — Ai, 62 + AT), 



thus. Equation (15 1 is anti-monotone. Recall that in r'{xi), 



Assume that (61,62) precedes (61,62) in F'. Then 61 > 6'i 
and 62 < 62. As 61 decreases to 6'i, both [/j'S'iJ6i and 5161 



PC{xi) degenerates into Equation (15 1. The next lemma 
summarizes the above discussion. 

Lemma 8 (Monotonicity of PC), (i) [/j'5'iJ6i > Oi 
is anti-monotone in F' and [/iS'2j62 > Oi is monotone in F'. 
(ii) If Equation jUW holds, PC{xi) is monotone in V'{xi), 
and if Equation TlW fails, PC{xi) is anti-monotone inT'{xi). 

Corollary 3. H{PC{x{),V') consists of H{[fiSi]bi > 
o„F'), H{PC{x,),r'{x^)), and H{[fiS2]b2 > o„F'). 

5.1.4 Algorithms 

The next theorem gives a computation of all pairs in F' 
satisfying both PC and FC, i.e., all valid pairs in F'. 

Theorem 4 (Computing all valid pairs in F'). LetV* 
be the intersection of H{FC{Si),T'), H{FC{S2),T'), and 
r]iII{PC{xi),T'). (i) V* contains exactly the valid pairs in 
F'. (ii) The first pair in F* (if any) is the optimal pair wrt 
(51,5*2). (Hi) F* can be computed in 0(mlog|T|) time and 
0{m) space. 

Proof, (i) follows from Tlicorem[3] From Lemma [5] the 
first pair in F* has the minimum loss wrt {Si, S2). To see (iii), 
the monotonicities in Lemma [7] and Lemma [S] and Corollary 
|3j imply that each sublist involved in computing F* can be 
found by a binary search over F', which takes 0(m log |T|) 
time (note that the length k' of F' is no more than |r|). Note 
that intersecting two sublists takes 0{1). The 0{m) space 
follows from the fact that each sublist is represented by its 
bounding interval and any element of F' examined by a binary 
search can be generated based on its position without storing 
the list. □ 



Algorithm 1 Optimal Two-Size Bucketing 

TwoSizeBucketing(r, F' , M, M') 
Input: T, 1 <i<m, F' , M, M' 

Output; tlie optimal bucket setting {(^i, 61), {S2, ^2)) 
1: compute Oi, 1 < i < m 
2: Bestioss 00 
3: Bestsetting NULL 

4: for all {Si = M; S^. < M' - 1; Si + +} do 
5: for all {52 = 5i + 1; S'2 < M'; S2 + +} do 
6: compute F* using Theorem [4] 
7: if r* is not empty then 
8: let (61, 62) be the first pair in T* 

9: let Bj be the set of bj buckets of size Sj, j = 1,2 

10: if Bestioss > Loss{Bi U B2) then 

11: BeStsetUng ^ ((Si, 61), (S2, 62)) 

12: Bestioss <- Loss{Bi U B2) 

13: return Bestsetung 



Algorithm [T] presents the algorithm for finding the optimal 
two-size bucket setting based on Theorem |4] TwoSizeBuck- 
eting. The input consists of a table T, a privacy parameter 
F' , and the minimum and maximum bucket sizes M and M' . 
Line 1 computes Oi in one scan of T. Lines 2 and 3 initial- 
ize Bestioss and Bestsetung- Lines 4 and 5 iterate through 
aU pairs (S'i,S2) with M < Si < S2 < M'. For each pair 
(Si, S2), Line 6 computes the list F* using Theorem |4] Lines 
8-12 compute Loss of the first pair in F* and update Bestioss 
and Bestsetting if necessary. Line 13 returns BestsetUng- The 
algorithm uses both loss-based pruning and privacy-based 
pruning. The former is through the prefix F' obtained by 
the upper bound Bestioss as computed in Lemma |6j and the 
latter is through the binary search of valid pairs implicit in 
the computation of F*. To tighten up Bestioss, Lines 4 and 
5 examine smaller sizes (Si,S2) before larger ones. 

5.2 Algorithms for Multi-Size Bucket Settings 

A natural next step is to extend the solution for the two- 
size problem to the multi-size problem. To do so, we must 
extend Theorem[3]to validate a three-size bucket setting. The 
next example shows that this does not work. 

Example 5. Let \Bi\ = IB2I = 20, IB3I = 30, and \T\ = 
70. There are 11 values x\,- ■ ■ ,xii: Oi = 5 for 1 < i < 10, 
and Oil ~ 20. Suppose that for 1 < i < 10, an = 0^2 ~ 0, 
fli3 = 5, and aii,i = aii_2 ~ an, 3 = 20. The following 
extended version of PC, FC and CC in Theorem^ Vi : an + 
a,2+aa > o,; for j = 1,2,3, ^.i^ij > \Bj\; \T\ = |Bii + 
IB2I + I-B3I. However, there is no valid record assignment to 
these buckets. Note that, for 1 < i < 10, an = ai2 = 0, none 
of the records for Xi can be assigned to the buckets for Bi or 
B2- So the 50 records for Xi, 1 < i < 10, must be assigned to 
the buckets for B3,, but B3 has a capacity of 30. 

Our solution is recursively applying TwoSizeBucketing to 
reduce Loss. This algorithm, MultiSizeBucketing, is given in 
Algorithm [2] The input consists of T, a set of records, B, 
a set of buckets of the same size, and F',M,M' as usual, 
where |r| = s{B). The algorithm applies TwoSizeBucket- 
ing to find the optimal two-size bucket setting [B\,B2) for 
T (Line 1). If Loss{Bi U B2) < Loss{B), Line 3 parti- 
tions the records of T into Tj and T2 between Bi and 52. 



Algorithm 2 Heuristic Multi-Size Bucketing 

MultiSizeBucketing(r, B, F' , M, M') 
Input: T, B, F',M,M' 

Output: a bucket setting (Bi, • ■ • , Bq) and Ti, 



, Tq , where 



(Bi, B2) ^ TwoSizeBucketing(T, F' , M, M') 

if Loss{Bi U B2) < Loss{B) then 

(ri,r2) RecordPartition{T, Bi, B2) (Section [43 1 
MultiSizeBucketing{Ti ,Bi,F',M,M') 
MultiSizeBucketing{T2, B2,F', M, M') 

else 

return(r, B) 



RecordPartition{T, Bi, B2) is the record partition procedure 
discussed in Section 14.31 Lines 4 and 5 recur on each of 
(Ti,Bi) and (72,-62). li Loss{Bi U B2) > Loss{B), Line 7 
returns the current bucket setting B for T. 

6. ADDITIONAL AUXILIARY INFORMATION 

Dealing with an adversary armed with additional auxiliary 
information is one of the hardest problems in data privacy. 
As pointed out by [l^, there is no free lunch in data pri- 
vacy. Thus, instead of dealing with all types of auxiliary 
information, we consider two previously identified attacks, 
namely, corruption attack [2l] and negative association at- 
tack 13 17 . To focus on the main idea, we consider F'- 
privacy such that f- is the same for all Xi's. In this case, 
F'-privacy degenerates into ^-diversity with £ — [l///] and 
the solution in Section 5.1 returns buckets of size Si = ^ or 
S2 = £ + 1, and each record in a bucket has a distinct SA 
value. 



In the corruption attack, an adversary has acquired from an 
external source the SA value Xi of some record r in the data. 
r is called a corrupted record. Armed with this knowledge, the 
adversary will boost the accuracy of inference by excluding 
one occurrence of Xi when inferring the sensitive value of the 
remaining records that share the same bucket with r. To 
combat the accuracy boosting, we propose to inject some 
small number a of fake SA values into each bucket g, where 
a fake value does not actually belong to any record in the 
bucket. To ensure that the adversary cannot distinguish a 
fake value from a real value, a fake value must be from the 
domain of SA and must be distinct in the bucket. Now, for 
each bucket g, the table QIT contains records and the 
table ST contains Isl + o" distinct SA values, in a random 
order. The adversary knows a of these SA values are fake 
but does not know which ones. 

Suppose now that in a corruption attack, the adversary is 
able to corrupt q records in a bucket g, where q < \g\, so 
\g\ — q + a values remain in g, a of which are fake. Note 
that l^l and a are constants. Therefore, the more records the 
adversary is able to corrupt (i.e., a larger q), the larger the 
proportion of fake values among the remaining records in the 
bucket (i.e., ^^^f^^^ ) and the more uncertain the adversary is 
about whether a remaining value in g is a real value or a fake 
value. Even if all but one record in a bucket are corrupted, 
the adversary has only 1/(1 -|- a) certainty that a remaining 
value is a real value. The price to pay for this additional 
protection is the distortion by the a fake values added to 
each bucket. 



The study in [13|[17| shows that under unusual circum- 
stances a negative association between a non-sensitive value 
2 and a SA value x may be learnt from the pubHshed data 
T*, which states that a record having z is less Ukely to have 
X. Using such negative association, an adversary could ex- 
clude unlikely choices x when inferring the sensitive value for 
an individual having the non-sensitive value z. Since this 
attack shares the same mechanism as the corruption attack, 
i.e., by excluding unlikely values, the above solution proposed 
for corruption attack can be applied to deter the negative as- 
sociation attack, with one difference: a fake value should not 
be easily excluded for any record using the negative associa- 
tion knowledge. To ensure this, the publisher can first learn 
the negative association from T* and inject only those fake 
values into a bucket that cannot be removed using the learnt 
negative association. 



Parameters 


Settings 


Cardinality |T| 


100k, 200k, 300k, 400k, 500k 


/,'-privacy for Xi 


fi = min{l,e X f^ + 0.02} 


Privacy coefficient 9 


2, 4, 8, 16, 32 


M 


min,{\l/f^]} 


M' 


50 



Table 4: Parameter settings 



7. EMPIRICAL STUDIES 

We evaluate the effectiveness and efficiency of the algo- 
rithms proposed in Section 5. For this purpose, we utilized 
the real data set CENSUS containing personal information of 
500K American adults. This data set was previously used in 
22 , [15] and [l9]. Table |5] shows the eight discrete attributes 
of the data. Two base tables were generated from CENSUS. 
The first table OCC has Occupation as SA and the 7 remain- 
ing attributes as the Ql-attributes. The second table EDU 
has Education as SA and the 7 remaining attributes as the 
Ql-attributes. OCC-n and EDU-n denote the data sets of 
OCC and EDU of the cardinality n. Figure [2] shows the fre- 
quency distribution of SA. The parameters and settings are 
summarized in Table [4] with the default setting in bold face. 
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Figure 2: Frequency distribution of SA 

We evaluate our algorithms by three criteria: suitability 
for handling varied sensitivity, data utility, and scalability. 

7.1 Criterion 1: Handling Variable Sensitivity 

Our first objective is to study the suitability of F'-privacy 
for handling variable sensitivity and skewed distribution of 
sensitive values. For concreteness, we specify _F'-privacy by 

= mm{l, 6 X fi + 0.02}, where 9 is the privacy coefficient 
chosen from {2,4,8, 16,32}. This specification models a lin- 
ear relationship between the sensitivity // and the frequency 



Attribute 


Domain Size 


Age 


76 


Gender 


2 


Education 


14 


Marital 


6 


Race 


9 


Work- Class 


10 


Country 


83 


Occupation 


50 



Table 5: Statistics of CENSUS 



fi for Xi. Since fi > fi for all Xi's, a solution satisfying F'- 
privacy always exists (Lemma [T]). In fact, a solution exists 
even with the maximum bucket size constraint M' = 50. 

For comparison purposes, we apply ^-diversity to model the 
above _F'-privacy, where £ is set to [l/mmi//] (Remark [ij. 
For the OCC-300K and EDU-300K data sets, which have the 
minimum fi of 0.18% and 0.44%, respectively. Figure [3] plots 
the relationship between 9 and £. Except for 9 — 32, a rather 
large £ is required to enforce _F'-privacy. As such, the buckets 
produced by Anatomy [22] have a large size £ or £+1, thus, a 
large Loss. A large £ also renders ^-diversity too restrictive. 
As discussed in Remark [l] l/£ > maxifi is necessary for 
having a ^-diversity solution. With OCC-300K's maximum 
fi being 7.5% and EDU-300K's maximum being 27.3%, this 
condition is violated for all ^ > 14 in the case of OCC-300K 
and all ^ > 4 in the case of EDU-300K, thus, for most F'- 
privacy considered. This study suggests that ^-diversity is 
not suitable for handling sensitive values of varied sensitivity 
and skewed distribution. 




16 32 




(a) OCC 



(b) EDU 



Figure 3: The relationship between £ (y-axis) and 
privacy coefficient 9 (x-axis) 



7.2 Criterion 2: Data Utility 

Our second objective is to evaluate the utility of T*. We 
consider two utility metrics. Mean Squared Error (MSE) (Def- 
inition [3| and Relative Error (RE) for count queries previ- 
ously used in ^22^. We compare TwoSizeBucketing, denoted 
by "TwoSize", and MultiSizeBucketing, denoted by "Multi- 
Size", against two other methods, (i) Optimal multi-size 
bucketing, denoted by "Optimal', is the exact solution to the 
optimal multi-size bucket setting problem, solved by an inte- 
ger linear program. "Optimal" provides the theoretical lower 
bound on Loss, but it is feasible only for a small domain 
size \SA\. (ii) Anatomy with ^-diversity being set to 
£ = \l/minifi \ . Except for "Anatomy", the minimum bucket 
size M is set to mm{[l//j']} and the maximum bucket size 



M' is set to 50. 

7. 2. 1 Mean Squared Error ( MSE) 

Figure [4] shows MSE vs the privacy coefficient 6 on the 
default OCC-300K and EDU-300K. The study in Section [711 
shows that for most _F'-privacy considered the corresponding 
^-diversity cannot be achieved on the OCC and EDU data 
sets. For comparison purposes, we compute the MSE for 
"Anatomy" based on the bucket size of I or while ignoring 
the privacy constraint. "Anatomy" has a significantly higher 
MSE than all other methods across all settings of 6 because 
the bucket sizes I and ^ + 1 are large. "TwoSize" has only 
a slightly higher MSE than "MultiSize", which has only a 
slightly higher MSE than "Optimal". This study suggests 
that the restriction to the two-size bucketing problem causes 
only a small loss of optimality and that the heuristic solution 
is a good approximation to the optimal solution of the multi- 
size bucket setting problem. 
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Figure 4: MSE (y-axis) vs privacy coefficient 9 (x- 
axis) 



7.2.2 Relative Error (RE) 



We adapt count queries Q of the form from 22 



SELECT COUNT(*) FROM T 

WHERE pred{Ai) AND ... AND pred(Aq^) AND pred{SA) 



^1, 



, Aq^ are randomly selected Ql-attributes. qd is the 



query dimensionality and is randomly selected from { 1 , • • • ,7} 
with equal probability, where 7 is the total number of QI at- 
tributes. For any attribute A, pred{A) has the form 

yl = ai OR ... ORA = flb, 

where a; is a random value from the domain of A. As in 
22 , the value of b depends on the expected query selectivity, 
which was set to 1% here. The details can be found in [22| . 
The answer act to Q using T is the number of records in T 
that satisfy the condition in the WHERE clause. We created 
a pool of 5,000 count queries of the above form. For each 
query Q in the pool, we compute the estimated answer est 
using T* in the same way as in [2^. The relative error (RE) 
on Q is defined to be RE — \act — est\/act. We report the 
average RE over all queries in the pool. 

Figure [5] shows RE vs the privacy coefficient 9 on the de- 
fault OCC-300K and EDU-300K. For the OCC data set, the 
maximum RE is slightly over 10%. The RE^s for "TwoSize", 
"MultiSize", and "Optimal" are relatively close to each other, 
which is consistent with the earlier finding on similar MSE 



for these algorithms. For the EDU data set, all RE's are 
no more than 10%. "MultiSize" improves upon "TwoSize" 
by about 2%, and "Optimal" improves upon "MultiSize" by 
about 2%. This study suggests that the solutions of the opti- 
mal two-size bucketing and the heuristic multi-size bucketing 
are highly accurate for answering count queries, with the RE 
below 10% for most F'-privacy considered. "Anatomy" was 
not included since there is no corresponding ^-diversity solu- 
tion for most f'-privacy considered (see Section 7.1). 
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Figure 5: Relative Error (%) (y-axis) vs privacy co- 
efHcient 9 (x-axis) 

7.3 Criterion 3: Scalability 

Lastly, we evaluate the scalability for handling large data 
sets. We focus on TwoSizeBucketing because it is a key com- 
ponent of MultiSizeBucketing. "No-pruning" refers to the se- 
quential search of the full list F without any pruning; "Loss- 
pruning" refers to the loss-based pruning in Section 5.1.2; 
"Full-pruning" refers to TwoSizeBucketing in Section 5.1.3, 
which exploits both loss-based pruning and privacy-based 
pruning. "Optimal" refers to the integer linear program solu- 
tion to the two-size bucketing problem. We study the Run- 
time with respect to the cardinality |r| and the domain size 
15*^1. The default privacy coefficient setting 9 = 8 is used. 
All algorithms were implemented in C-|— I- and run on a Win- 
dows 64 bits Platform with CPU of 2.53 GHz and memory 
size of 12GB. Each algorithm was run 100 times and the av- 
erage time is reported here. 

7.3.1 Scalability with \T\ 

Figure [g] shows Runtime vs the cardinality |T|. "Full- 
pruning" takes the least time and "No-pruning" takes the most 
time. "Loss-pruning" significantly reduces the time compared 
to "No-pruning", but has an increasing trend in Runtime as 
|T| increases because of the sequential search of the first valid 
pair in the list F'. In contrast, a larger \T\ does not affect 
"Full-pruning" much because ""Full-pruning" locates the first 
valid pair by a binary search over F'. "Optimal" takes less 
time than "No-pruning" because the domain size \SA\ is rela- 
tively small. The next experiment shows that the comparison 
is reversed for a large domain size \SA\. 

7.3.2 Scalability with \SA\ 

We scale up \SA\ for OCC-500K and EDU-500K by a fac- 
tor 7, where 7 is ranged over 2, 4, 8, 16, 32 and 64. As- 
sume that the domain of SA has the form {0, 1, ■ ■ ■ , m — 1}. 
For each record t in T, we replace f[5'yl] in t with the value 
7 X t[SA\ + r, where r is an integer selected randomly from 
the range [0, 7 — 1] with equal probability. Thus the new do- 
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Figure 6: Runtime (seconds) (y-axis) vs cardinality 
|T| (x-axis) 
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Figure 7: Runtime (seconds) (y-axis) vs scale-up fac- 
tor 7 for (x-axis) 



main of SA has the size m x 7. Figure [T] shows Runtime vs 
the scale-up factor 7. As 7 increases, Runtime of "Optimal" 
increases quickly because the integer linear programming is 
exponential in the domain size | SA\ . Runtime of the other 
algorithms increases little because the complexity of these al- 
gorithms is linear in the domain size \SA\. Interestingly, as 
\SA\ increases. Runtime of "No-pruning" decreases. A close 
look reveals that when there are more 5*^4 values, fi and 
become smaller and the minimum bucket size M becomes 
larger, which leads to a short F list. A shorter F list benefits 
most the sequential search based "No-pruning". 

In summary, we showed that the proposed methods can 
better handle sensitive values of varied sensitivity and skewed 
distribution, therefore, retain more information in the data, 
and the solution is scalable for large data sets. 

8. CONCLUSION 

Although differential privacy has many nice properties, it 
does not address the concern of inferential privacy, which 
arises due to the wide use of statistical inferences in advanced 
applications. On the other hand, previous approaches to in- 
ferential privacy suffered from major limitations, namely, lack 
of flexibility in handling varied sensitivity, poor utility, and 
vulnerability to auxiliary information. This paper developed 
a novel solution to overcome these limitations. Extensive ex- 
perimental results confirmed the suitability of the proposed 
solution for handling sensitive values of varied sensitivity and 
skewed distribution. 
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